20. Normal Distributions

Binomial Distribution

The Binomial Distribution

Recall many coin flips problem

We already saw, as the number of flips grew larger, the distribution of X started to form a bell shaped pattern. Let us recollect that for a moment.

4 Flips - Tree

In [1]:
from graphviz import Digraph
from coinflipviz import draw_graph, get_combinations, get_combinations_consolidated, plot_combinations_consolidated

n_flips = 4

g = Digraph()
g = draw_graph(g, n_flips)
g



Out[1]:
%3 Root R H1 H Root->H1 p T1 T Root->T1 q H2 H H1->H2 p T2 T H1->T2 q H3 H T1->H3 p T3 T T1->T3 q H4 H H2->H4 p T4 T H2->T4 q H5 H T2->H5 p T5 T T2->T5 q H6 H H3->H6 p T6 T H3->T6 q H7 H T3->H7 p T7 T T3->T7 q H8 H H4->H8 p T8 T H4->T8 q H9 H T4->H9 p T9 T T4->T9 q H10 H H5->H10 p T10 T H5->T10 q H11 H T5->H11 p T11 T T5->T11 q H12 H H6->H12 p T12 T H6->T12 q H13 H T6->H13 p T13 T T6->T13 q H14 H H7->H14 p T14 T H7->T14 q H15 H T7->H15 p T15 T T7->T15 q

All possible sequences

In [2]:
combi_df = get_combinations(n_flips)
combi_df
Out[2]:
sequence x
0 HHHH 4
1 THHH 3
2 HTHH 3
3 TTHH 2
4 HHTH 3
5 THTH 2
6 HTTH 2
7 TTTH 1
8 HHHT 3
9 THHT 2
10 HTHT 2
11 TTHT 1
12 HHTT 2
13 THTT 1
14 HTTT 1
15 TTTT 0

Consolidated no of sequences under X

In [3]:
final_df = get_combinations_consolidated(n_flips)
final_df
Out[3]:
x n(x) p(x)
0 0 1 0.0625
1 1 4 0.2500
2 2 6 0.3750
3 3 4 0.2500
4 4 1 0.0625

Graphs

In [4]:
plot_combinations_consolidated(final_df)

What is p(X=k)?

Let k =2, that is, X = 2, the no of heads being 2 in a sequence. From above table/ LHs graph we could see, there are 6 sequences with k = 2 heads. We already know, total no of sequence could be given by $2^4 = 16$.

So,
$ p(X=k) \ = p(X=2) \ = \ \dfrac {\text {No of seq with 2 heads}}{\text {Total no of seq}} \ = \ \dfrac {6}{2^4} = 0.375 \tag{1} $

Via formula

In fact we have a formula hidden in our observation. Note that, the no of heads in sequence is a typical combinatorics problem. The solution can be obtained by Combination formula. For example, out of 4 possible placements of a sequence(resulting after all flips), what are the number of occurances where, 2 placements are occupied with heads?

$ \text {No of seq with 2 heads out of 4 placements} \ = \ \dfrac {n!}{k!(n-k)!} \ = \ \dfrac {4!}{2!(4-2)!} \ = \ 6 \tag{2} $

Each of those sequences has a joint probability if individual probabilities are known. Assuming equal probabilities for both head and tail (that is 0.5 each), we could compute total joint probability as follows.

$ \text {Probability of 2 heads and 2 tail of any one sequence} \ = \ p \times p \times \ q \times q \ = \ (0.5) \times (0.5) \times (0.5) \times (0.5) \ = \ (0.5)^2(0.5)^2 \tag{3} $

Combining ${2}$ and ${3}$,
$ \text {Probability of all sequences with 2 heads out of 4 placements} \ = \ 6 \times \ (0.5)^2(0.5)^2 = 0.375 \tag{4} $

This could also become evident, if we re write equation ${1}$ as below.

$$ p(X=k) \ = p(X=2) \ = \ \dfrac {6}{2^4} \ = \ 6 \times \Big( \dfrac {1}{2} \Big)^2\Big( \dfrac {1}{2} \Big)^2 \ = \ 6 \times \ (0.5)^2(0.5)^2 = 0.375 $$

Generalizing,

we could write as for any k,

$$ \color {blue} {p(X = k) \ = \ \dfrac {n!}{k!(n-k)!}p^{k}q^{n-k} } \tag{5} $$

Try 10 flips

Let us check with 10 flips.. Obviously tree viz would be too large, so we will directly compute the graphs and compare.

In [5]:
n_flips = 10
final_df = get_combinations_consolidated(n_flips)
plot_combinations_consolidated(final_df)

Let us check for any k. Say, k = 5,

Graph:

From RHS graph, $ p(X=k) = p(X=5) = 0.24609375 $

Formula:

Plugging in n=10, k=5, p =0.5, q = 1-p = 0.5, in equation ${5}$
$$ p(X = k) \ = \ \dfrac {n!}{k!(n-k)!}p^{k}q^{n-k} \ = \ \dfrac {10!}{5!(10-5)!}(0.5)^{5}(0.5)^{10-5} \\ \\ = \ \dfrac {10 \times 9 \times 8 \times 7 \times 6}{5 \times 4 \times 3 \times 2 \times 1}(0.5)^{5}(0.5)^{5} \\ \\ = \ (252)(0.5)^{10} = 0.24609375 $$

A Corollary - Bernoulli trial and Binomial Distribution

The coin flip problem is a typical Bernoulli trial.

A Bernoulli trial is each repetition of an experiment with only 2 outcomes

We are often interested in the result of independent, repeated bernoulli trials, i.e. the number of successes in repeated trials.

  1. independent - the result of one trial does not affect the result of another trial.
  2. repeated - conditions are the same for each trial, i.e. p and q remain constant across trials. Hayes refers to this as a stationary process. If p and q can change from trial to trial, the process is nonstationary. The term identically distributed is also often used.

A binomial distribution gives us the probabilities associated with independent, repeated Bernoulli trials. In a binomial distribution the probabilities of interest are those of receiving a certain number of successes, r, in n independent trials each having only two possible outcomes and the same probability, p, of success. So, for example, using a binomial distribution, we can determine the probability of getting 4 heads in 10 coin tosses

Any random variable X with probability function given by

$$ \color {blue} {p(X=k; N,p) \ = \ \dfrac {N!}{k!(N-k)!}p^{k}q^{N-k} \ , \ X = 0,1,2,....N} $$

is said to have a binomial distribution with parameters N and p.

Courtesy: The Binomial Distribution

Maximum Probability

From 10 flips problem graphs above, one could easily observe, tha maximum no of sequences with k or maximum probability of that occurs when $k=5$ which is, $\dfrac {10}{2} = \dfrac {n}{2}$

Mathematically, the function $\dfrac {n!}{k!(n-k)!}p^{k}q^{n-k}$ becomes maximum when $k=\dfrac {n}{2}$

$ \dfrac {n!}{k!(n-k)!}p^{k}q^{n-k}\Bigg\rvert_{k=\dfrac{n}{2}} \ = \ \dfrac{n!}{2\Big(\dfrac{n}{2}\Big)!} $

Mean of binomial distribution

But where is this $\dfrac {n}{2}$ coming from? Well, that is the mean of our binomial distribution coin flip problem. The graphs simply maximized at the mean $E(X)$.

Emperical Proof

Let us try to reverse engineer a little bit. Let us consider again no of flips = 4.

In [6]:
n_flips = 4
final_df = get_combinations_consolidated(n_flips)
plot_combinations_consolidated(final_df)

In plain english, from above LHS graph, 0 heads occur 1 time, 1 heads occur 4 times, 2 heads 6 times and so on.

So What is the mean of no of heads in this experiment? In other words, mean value for X. We know from graph it is 2, but let us find it mathematically there by deriving general formula $E(X)$

From LHS graph, we know the frequency of X for each possible X value. n(X=0) = 1, n(X=1) = 4 etc.

So by general mean formula applicable to any regular data series,

$$ E(X) = \dfrac {0 \cdot 1 + 1 \cdot 4 + 2 \cdot 6 + 3 \cdot 4 + 4 \cdot 1}{1+4+6+4+1} \\ \\ = \dfrac {0 + 4 + 12 + 12 + 4}{16} = \dfrac {32}{16} = 2 $$

Let us rewrite from derived solution,..

$$ \begin{array}{l} E(X) = \dfrac {0 + 4 + 12 + 12 + 4}{16} \\ \\ = \dfrac { \sum\limits_{k=0}^4 X_k \cdot n(X_k) }{2^4} \\ \\ = \sum\limits_{k=0}^4 \dfrac {X_k \cdot n(X_k)}{2^4} \\ \\ \color {blue} {= \sum\limits_{k=0}^4 X_k \cdot p(X_k)} \\ \\ \end{array} $$

Note, $X_k$ is just $X=k$.

We could also expand this to again arrive at our solution, just to clarify how things unfold to and fro.

$ \require{cancel} \begin{array}{l} E(X) = \sum\limits_{k=0}^{4} X_k \cdot p(X_k) \\ \\ = 0 \cdot p(X=0) + 1 \cdot p(X=1) + 2 \cdot p(X=2) + 3 \cdot p(X=3) + 4 \cdot p(X=4) \\ \\ = 0 + 1 \cdot \binom{4}{1}p^{1}(1-p)^{3} + 2 \cdot \binom{4}{2}p^{2}(1-p)^{2} + 3 \cdot \binom{4}{3}p^{3}(1-p)^{1} + 4 \cdot \binom{4}{4}p^{4}(1-p)^0 \\ \\ = 0 + 4 \cdot p^{1}(1-p)^{3} + 12 \cdot p^{2}(1-p)^{2} + 12 \cdot p^{3}(1-p)^{1} + 4 \cdot p^{4} \\ \\ = 0 + 4p(1-p)\Big\{ (1-p)^2 + 3p(1-p) + 3p^2\Big\} + 4p^4 \\ \\ = 0 + 4p(1-p)\Big\{ (1-p)^2 + 3p - \cancel{3p^2} + \cancel{3p^2} \Big\} + 4p^4 \\ \\ = 0 + 4p(1-p)\Big\{ 1 - 2p + p^2 + 3p \Big\} + 4p^4 \\ \\ = 0 + 4p(1-p)\Big\{ 1 + p + p^2 \Big\} + 4p^4 \\ \\ = 0 + 4p \Big\{ 1 + \cancel{p^2} + \cancel{p} - \cancel{p} - p^3 - \cancel{p^2} \Big\} + 4p^4\\ \\ = 0 + 4p \Big\{ 1 - p^3 \Big\} + 4p^4 \\ \\ = 4p \end{array} $

The 4 in above equation is number of flips.

Proof by Binomial theorem

According to the theorem, it is possible to expand any power of x + y into a sum of the form

$$ (x+y)^{n}=\sum _{k=0}^{n}{n \choose k}x^{k}y^{n-k}. $$
For n = 3, this would be,

$$ \begin{array}{l} (x+y)^{3}=\sum _{k=0}^{3}{3 \choose k}x^{k}y^{3-k} \\ \\ = \binom {3}{0}x^{0}y^{3} + \binom {3}{1}x^{1}y^{2} + \binom {3}{2}x^{2}y^{1} + \binom {3}{3}x^{3}y^{0} \end{array} $$

$$ =y^3 + 3xy^2 + 3x^2y + x^3 \tag{6} $$

Note the coefficients: [1 , 3 , 3 , 1]. These are the binomial coefficients, and could be found in Pascal's triangle at the row corresponding to the value n

image.png

Suppose q denotes the probability of tails, q = 1 - p, then, we could rewrite equation ${6}$ as below

$ \begin{array}{l} E(X) = \sum\limits_{k=0}^{4} X_k \cdot p(X_k) \\ \\ = 0 \cdot p(X=0) + 1 \cdot p(X=1) + 2 \cdot p(X=2) + 3 \cdot p(X=3) + 4 \cdot p(X=4) \\ \\ = 0 + 1 \cdot \binom{4}{1}p^{1}q^{3} + 2 \cdot \binom{4}{2}p^{2}q^{2} + 3 \cdot \binom{4}{3}p^{3}q^{1} + 4 \cdot \binom{4}{4}p^{4}q^0 \\ \\ = 0 + 4pq^3 + 12p^{2}q^{2} + 12p^{3}q + 4p^4 \\ \\ =4p\Big\{ q^3 + 3pq^{2} + 3p^{2}q + p^3 \Big\} \\ \\ =4p\Big\{ (p+q)^3 \Big\} \ \ \text{using binomial theorem}\\ \\ \\ \end{array} $
Note, $p+q=1$ $$ \therefore \ \ E(X) = 4p $$

Generalization

Thus in general, we could calculate the mean of a binomial distribution of n flips as follows.

$$ \begin{array}{l} \color {blue} {E(X) = \sum\limits_{k=0}^nX_k \cdot p(X_k) = np} \tag{7} \end{array} $$

We had proven with specific number of flips, but this could be proved for any n mathematicallym with same approach.

Variance of binomial distribution

Emperical Proof

Let us revisit again graphs for no of flips = 4.

In [7]:
n_flips = 4
final_df = get_combinations_consolidated(n_flips)
plot_combinations_consolidated(final_df)
In [8]:
final_df
Out[8]:
x n(x) p(x)
0 0 1 0.0625
1 1 4 0.2500
2 2 6 0.3750
3 3 4 0.2500
4 4 1 0.0625

In plain english, from above LHS graph, 0 heads occur 1 time, 1 heads occur 4 times, 2 heads 6 times and so on.

So What is the variance of no of heads in this experiment? In other words, variance $\sigma^2$ for X. Let us find it mathematically by using general formula $Var(X)$

From LHS graph, we know the frequency of X for each possible X value. n(X=0) = 1, n(X=1) = 4 etc.

So by general variance formula applicable to any regular data series (given with frequency count of each X),

$$ Var(X) = \dfrac { (0-2)^2(1) + (1-2)^2(4) + (2-2)^2(6) + (3-2)^2(4) + (4-2)^2(1)}{2^4} \\ \\ = \dfrac {4+4+0+4+4}{16} = \dfrac {16}{16} = 1 $$

Let us rewrite from derived solution,..

$$ \begin{array}{l} Var(X) = \dfrac {0 + 4 + 12 + 12 + 4}{16} \\ \\ = \sum\limits_{k=0}^4 \dfrac {(X_{k}-\bar{X})^2n(X_k)}{2^4} = \sum\limits_{k=0}^4 (X_{k}-\bar{X})^2p(X_k) \end{array} $$

Thus, $$ \color {blue} {\sigma^2 = Var(X) = \sum\limits_{k=0}^n (X_{k}-\bar{X})^2p(X_k)} \tag {8} $$

Using equation $7$, $$ E(X) = \sum\limits_{k=0}^nX_k \cdot p(X_k) \\ \\ $$ $$ \therefore E(X^2) = \sum\limits_{k=0}^nX^2_k \cdot p(X_k) \tag {9} $$

Expanding equation $8$,

$$ \begin{array}{l} Var(X) = \sum\limits_{k=0}^n (X_{k}-\bar{X})^2p(X_k) \\ \\ = \sum\limits_{k=0}^n\Big\{ X^2_k-2X_k\bar{X} + \bar{X}^2 \Big\}p(X_k) \\ \\ \\ = \sum\limits_{k=0}^nX^2_kp(X_k) - 2\bar{X}\sum\limits_{k=0}^nX_kp(X_k) + \bar{X}^2\sum\limits_{k=0}^np(X_k) \\ \\ \end{array} $$

$$ \sum\limits_{k=0}^np(X_k) = 1 \because \text {that is total probability of our problem which is 1} $$

Using that and also equation $9$, $$ Var(X) = E(X^2)-2\bar{X}^2+\bar{X}^2 = E(X^2) - \bar{X}^2 $$

Thus, $$ \color {blue} {\sigma^2 = Var(X) = E(X^2) - [E(X)]^2} \tag {10} $$

This alternate form could be useful to deduce a simplified formula for variance of binomial distribution as follows.

For n=4, we could expand equation $9$ as below.

$$ \begin{array}{l} E(X^2) = \sum\limits_{k=0}^4X^2_kp(X_k) \\ \\ = 0 + 1 \cdot \binom{4}{1}p^1q^3 + 4 \cdot \binom{4}{2}p^2q^2 + 9 \cdot \binom{4}{3}p^3q^1 + 16 \cdot \binom{4}{4}p^4q^0 \\ \\ = 4pq^3 + 24p^2q^2 + 36p^3q + 16p^4 \\ \\ = 4pq \big\{\ q^2 + 6pq+9p^2 \ \} + 16p^4 \\ \\ = 4pq \big\{ q^2 + 2(3p)q + (3p)^2 \big\} + 16p^4 \\ \\ = 4pq(q+3p)^2 + 16p^4 \\ \\ = 4pq(1+2p)^2 + 16p^4 \\ \\ = 4pq(1+4p+4p^2) + 16p^4 \\ \\ = 4pq + 16p^2q + 16p^3q + 16p^4 \\ \\ \\ \text {Var}(X) = E(X^2)-[E(X)]^2 \\ \\ = \big \{ 4pq + 16p^2q + 16p^3q + 16p^4 \big \} - 16p^2 \\ \\ = 4pq + 16p^2(1-p) + 16p^3q + 16p^4 - 16p^2 \\ \\ = 4pq + \cancel{16p^2} - 16p^3 + 16p^3q+16p^4 - \cancel{16p^2} \\ \\ = 4pq + 16p^3q+16p^4-16p^3 \\ \\ = 4pq + 16p^3q + 16p^3(p-1) \\ \\ = 4pq + \cancel{16p^3q} - \cancel{16p^3q} \\ \\ = 4pq \end{array} $$

Generalizing, for binomail problem of n flips, p probability of X, and p+q=1,

$$ \color {blue} {\sigma^2 = Var(X) = \sum\limits_{k=0}^n (X_{k}-\bar{X})^2p(X_k) = E(X^2) - [E(X)]^2 = npq} \tag {11} $$

Try 20 flips

Let us check mean and variance now with 20 flips.. Obviously tree viz would be too large, so we will directly compute the graphs and compare.

In [9]:
n_flips = 20
final_df = get_combinations_consolidated(n_flips)
plot_combinations_consolidated(final_df, label=False)

We know, n = 20, p = 0.5, q = 0.5

Mean

$ \mu = E(X) = np = (20)(0.5) = 10 $

This is easily verifiable from above graph, that at X=10, we get maximum n(X) and also p(X).

Variance

$ \sigma^2 = Var(X) = npq = (20)(0.5)(0.5) = 5 $

Conclusion

We have derived the three main formula for a problem with binomial distribution (Coin toss). For n no of flips, with probability of heads p being 0.5 at each flip, and q = 1 - p, we could calculate

1. Probability of X = k

$$ \color {blue} {p(X=k; n,p) \ = \ \dfrac {n!}{k!(n-k)!}p^{k}q^{n-k} \ , \ X = 0,1,2,....n} $$

2. Mean

$$ \color {blue} {\mu = E(X) = np} $$

3. Variance

$$ \color {blue} {\sigma^2 = Var(X) = npq} $$

Normal Approximation

The Normal Distribution

So far

We have derived the three main formula for a problem with binomial distribution (Coin toss). For n no of flips, with probability of heads p being 0.5 at each flip, and q = 1 - p, we could calculate

1. Probability of X = k

$$ \color {blue} {p(X=k; n,p) \ = \ \dfrac {n!}{k!(n-k)!}p^{k}q^{n-k} \ , \ X = 0,1,2,....n} $$

2. Mean

$$ \color {blue} {\mu = E(X) = np} $$

3. Variance

$$ \color {blue} {\sigma^2 = Var(X) = npq} $$

Introduction

When n is very high, even after 20 flips, it is computationally challenging to calculate all the possible combinations because the complexity grows exponentially. This calls for certain approximation for large no of flips. This is where we try to approximate with another continous function called Normal Distribution.

bi%20to%20nor.png

There is a huge history behind coming up with such approximation for error curves, and 1 is one interesting article that talks in detail about Normal curve. Obviously we will go in to detail but try to prove how the continuous function fits our error cuve.

Visual Proof

In its raw form, a simple probability density error function $\varphi(x)$ looks like below.

$$ {\varphi (x)={\dfrac {1}{\sqrt {2\pi }}}e^{-{\frac {1}{2}}x^{2}}} \tag{1} $$

Let us plot both $\varphi(x)$ and our binomial probability curve together.

In [6]:
from coinflipviz import get_combinations_consolidated

n_flips = 20
final_df = get_combinations_consolidated(n_flips)
In [7]:
from bi_to_nor_demo import plot_bi_nor

plot_bi_nor(final_df, C='1/(sqrt(2*pi))', E='-(X**2)/2')

$\varphi(x)$ looks taller, and placed on left side. At the moment, does not give any hope it could be a good approximation. But let us manipulate the curve next.

Shift to right

Note, $\varphi(x)$ is centered at X=0, we need to be centered at X=10, which is the mean of our problem. So we just do that.

$$ {\varphi (x)={\dfrac {1}{\sqrt {2\pi }}}e^{-{\frac {1}{2}}(x-\color{blue}{\mu})^{2}}} $$

In [8]:
plot_bi_nor(final_df, mu=10, C='1/(sqrt(2*pi))', E='-((X-mu)**2)/2')  # note (X-mu) term in E denoting the shift

Scale down

Of course now $\varphi(x)$ is taller, so let us scale down by $\sigma$

$$ {\varphi (x)={\dfrac {1}{\color{blue}{\sigma} \sqrt {2\pi }}}e^{-{\dfrac {1}{2}}(x-\color{blue}{\mu})^{2}}} $$

Remember, our $\sigma^2$ was 5, so $\sigma$ would be 2.236

In [9]:
plot_bi_nor(final_df, sigma=2.236, mu=10, C='1/(sigma*sqrt(2*pi))', E='-((X-mu)**2)/2')  

Stretch it

$\varphi(x)$ is thinner, so still not quite approximate. To stretch let us divide the power of exponent by $\sigma$. Note we are operating on X value, so just like $\mu$, the $\sigma$ happens before squaring up X.

$$ \varphi (x)={\frac {1}{\color{blue}{\sigma} \sqrt {2\pi }}} e^{-\dfrac {1}{2}\Big( \dfrac {x - \mu}{\color {blue}{\sigma}} \Big)^2} $$

In [10]:
plot_bi_nor(final_df, sigma=2.236, mu=10, C='1/(sigma*sqrt(2*pi))', E='-(((X-mu)/sigma)**2)/2')  

Now I hope, it is visually satisfactory to accept, the normal approximation function $\varphi(x)$ as replacement for large n

Mathematical Proof

It can also be proven mathematically that, binomial function could be approximated by normal probability density function we just saw, however it is quite tedious, also skipped in Udacity. We will try some time in future.

For those interested, there are multiple ways to prove but I find De Moivre-Laplace to be most straight forward starting from our binomial distribution problem.

Conclusion

Thus, a binomial distribution problem with sufficiently large n, such that $np \geq 10$, can be approximate by normal probability density function as below, where $\mu$ and $\sigma$ are mean and standard deviation respectively.

$$ \color {blue} {\varphi (x)={\frac {1}{\sigma \sqrt {2\pi} }} e^{-\dfrac {1}{2}\Big( \dfrac {x - \mu}{\sigma} \Big)^2} \tag {2}} $$

where, $ \mu = np \ , \ \sigma = np(1-p) $

Tips

Continuity Correction

$P(X=k)$ in binomial $\neq$ $p(X=k)$ in Normal
In fact, $p(X=k)$ in Normal is zero because it is a continous function. So always consider taking closest range.
For eg, for $P(X=8)$ in binomial, if you use Normal, you could try $p(7.5<X<8.5)$. This is called continuity correction.

Remember, binomial distribution is discrete, while normal is continous.

Sampling Distributions

Sampling Distributions

Sampling Distribution of Sample Proportions

This typically applies to case where the variable under study is categorical. For eg, a population of gumballs, where 60% of them are yellow balls.

Gumballs

Setup

Bernoulli:

Let Y be the random variable of picking up a gumball out of say 10000 balls which is our population $T$. 60% of those balls are yellow and rest other colors.

Our population would thus look like this (1 indicating yellow, 0 indicating other colors):
$[1,0,0,0,1,1,0,1,0,1,1,0,0,0,1, \cdots ,1]$

$ Y = \begin{cases} 1 \ \ \text{if ball picked up is yellow} \\ 0 \ \ \text{if not} \end{cases} $

Then, the mean and variance of $Y$ van be calculated as below Proof here

$ \mu_y = p \ \ \ \ \sigma_y^2 = pq $
where, $p$ is probability of picking up yellow ball, and $q$, other balls.

Suppose, again by some Godly means, we know, $p = 0.6$ then

$ \mu_y = p = 0.6 \ \ \ \ \sigma_y^2 = pq = (0.6)(0.4) = 0.24 $

$$ \color {blue}{ \mu = \mu_y = p = 0.6 \\ \sigma^2 = \sigma_y^2 = pq = 0.24 \\ \sigma = \sigma_y = \sqrt{0.24} = 0.4898 } \tag{1} $$

In [70]:
%matplotlib inline

from random import shuffle 
import matplotlib.pyplot as plt
from SDSPSM import get_metrics, drawBarGraph

T = 10000  # total size of population
p = 0.6    # 60% has yellow balls

y_freq = int(p*T)                 
y_pops = [1]*y_freq        
o_freq = int((1-p)*T)
o_pops = [0]*o_freq
population = y_pops + o_pops
population_freq = [o_freq, y_freq]
shuffle(population)    

# population metrics
mu, var, sigma = get_metrics(population)

#visualize
backgroundColour = '#F5F5FF'
fig, (ax1) = plt.subplots(1,1, figsize=(5,3), facecolor=backgroundColour)
drawBarGraph(population_freq, ax1, [T, mu, var, sigma], 'Population Distribution','Gumballs', 'Counts',xmin=0)


ax1.set_facecolor(backgroundColour)
# fig.patch.set_facecolor(backgroundColour)
plt.show()

A rough visualization of distribution, density (so total height of bars equal 1), and probability mass function of our current population

In [72]:
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15,4))

def mini_plot_SDSP(raw_list, norm_off=False, width=0.1): 

    ax1.hist(raw_list) 
    ax1.set_title('Distribution')

    _, bins,_ = ax2.hist(raw_list, density=True) 
    ax2.set_title('Density')

    # probability mass
    dummy_dict = {i:raw_list.count(i) for i in raw_list}
    total = sum(list(dummy_dict.values()))
    pmf = {key: round(val/total,4) for key, val in dummy_dict.items()}
    ax3.bar(list(pmf.keys()), list(pmf.values()), width=width)
    ax3.set_title('PMF')

    # normal approx overlay if needed
    if norm_off == False:  # so user wants normal curve overlay
        mu, var, sigma = get_metrics(raw_list)
        import numpy as np
        X = np.linspace(min(bins),max(bins),10*len(bins))
        from math import sqrt, pi
        Cp = 1/(sigma*sqrt(2*pi))
        Ep = -1/2*((X-mu)/sigma)**2
        G = Cp*np.exp(Ep)    
        metrics_text = '$\mu_x:{}$ \n$\sigma_x:{}$'.format(mu, sigma)
        ax2.text(0.97, 0.98,metrics_text,ha='right', va='top',transform = ax2.transAxes,fontsize=10,color='red')    
        ax2.plot(X, G, color='red')      


mini_plot_SDSP(population, norm_off=True)
plt.show()

Statistical Outcomes:

Let us randomly sample from the population. One trial/experiment is sampling about, say $n=10$ samples. Likewise, we sample for many experiments, say $N=2000$.

Remember: We sample WITH REPLACEMENT

Steps:

  • 1.Take a random sample of $n=10$ balls from the population. Example sample output: $\widehat{Y_1} = [1,0,0,1,0, \cdots ,0]$ denoting our 1st sample of size $n=10$. ^ just indicates, its a statistical outcome
  • 2.Take the mean of it and note down for kth experiment. $\overline{\widehat{Y_k}} = \dfrac {1}{n} \sum\limits_{i=1}^n Y_{ki} = \dfrac {1}{10} \sum\limits_{i=1}^{10} Y_{ki}$. Example sample output: $\overline{\widehat{Y_1}} = 0.7$
  • 3.Do step 1 and 2 for $N=2000$ times. That is, $N=2000$ experiments $\implies$ $2000$ mean values $\overline{\widehat{Y_k}}$, for $k = 0,1,2,3.... N$ Note by now $\overline{\widehat{Y_k}}$ itself looks like a random variable taking on different values of mean but with repetitions, which should give us an intuition, the highest repetition will be the one equal to population mean. Example total output:
$\widehat{Y_k}$ $\overline{\widehat{Y_k}}$
1 0.6
2 0.9
3 0.6
4 0.4
5 0.1
... ...
2000 0.3
  • 4.Take the frequency count of the samples and prepare a concise frequency chart., with each row of mean value against its frequency. If the list is small you might already note, highest frequency happens around population mean. Example total output:
$\overline{\widehat{Y_j}}$   $n(\overline{\widehat{Y_j}})$
0.1 8
0.2 23
0.3 65
... ...
0.9 20
1 5
  • 5.Calculate probability for each row of above table, by simply taking the frequency divided by total outcomes.Note denomitor sums up to 2000 because, statistically we get 1 outcome per experiment (which is a mean), so total no of occurances after 2000 experiments would be 2000 outcomes (2000 means).
    $ p\Big(\overline{\widehat{Y_j}}\Big) = \dfrac {n\Big(\overline{\widehat{Y_j}}\Big)}{\sum n\Big(\overline{\widehat{Y_j}}\Big)} = \dfrac {n\Big(\overline{\widehat{Y_j}}\Big)}{2000} $
$\overline{\widehat{Y_j}}$   $n(\overline{\widehat{Y_j}})$   $p(\overline{\widehat{Y_j}})$
0.1 8 0.0001
0.2 23 0.002
0.3 65 0.034
... ...
0.9 20 0.002
1 5 0.0002

From above table, one could draw the probability mass function which is a discrete probability distribution where one could find for any $k$, the probability $p(X=k)$ discretly

  • 6.Calculate mean, variance and SD as below.Suppose there are total M rows in the table. We would find that,
    $ \mu(\overline{\widehat{Y}}) = \overline{\overline{\widehat{Y}}} = \sum\limits_{j=1}^M \overline{\widehat{Y_j}}p(\overline{\widehat{Y_j}}) \approx \color {blue}{0.6}\\ \\ \sigma(\overline{\widehat{Y}})^2 = \sum\limits_{j=1}^M \Big(\overline{\widehat{Y_j}} - \mu(\overline{\widehat{Y}})\Big)^2p(\overline{\widehat{Y_j}}) \approx \color {blue}{0.0226}\\ \\ \therefore \ \ \sigma(\overline{\widehat{Y}}) \approx \sqrt{0.0226} = \color {blue}{0.1505} \tag {2} $

Its much simpler programmatically..

In [73]:
from random import choices

N = 2000
n = 10

Y_hat = []
Y_mean_list = []
for each_experiment in range(N):  
    Y_hat = choices(population, k=n)  # sample with replacement
    Y_mean = sum(Y_hat)/len(Y_hat)
    Y_mean_list.append(Y_mean)

mu, var, sigma = get_metrics(Y_mean_list)
print(mu, var, sigma)
0.6048 0.024 0.1549

A quick visualization of our resultant statistical outcome Y_mean_list

In [74]:
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15,4))

mini_plot_SDSP(Y_mean_list)
plt.show()

Notations and Conditions

You could already see the resultant sampling distribution shaping up as normal. A visual summary is provided below where, normal distribution approximation is also applied.

Back to variables, the complicated statistical notation are typically indicated by simpler notations as follows. We used former because we did not want to lose details for sake of simplicity.Now that we have the insight, we could swap with these simpler conventional notations.

$$ \color {blue}{ \text{Random Variable} \ \ \widehat{p} = \overline{\widehat{Y}} \\ \\ \mu_\widehat{p} = \mu(\overline{\widehat{Y}}) \\ \\ \sigma_\widehat{p} = \sigma(\overline{\widehat{Y}}) } $$

Comparing equations set $1$ and $2$,

$$ \color {blue} { \mu_\widehat{p} \approx 0.6 = \mu = p \\ \\ \sigma_\widehat{p} \approx 0.1505 \approx \dfrac{0.4898}{\sqrt{10}} = \dfrac {\sigma}{\sqrt{n}} = \sqrt{\dfrac {pq}{n}} \tag {3} } $$

$\widehat{p}$ is called the sample proportion. The resultant sampling distribution could be satisfactorily approximated by normal density function when sample size is sufficiently large.

Conditions:

How much is sufficiently large? General thumb rule is $np \geq 10$ and $nq \geq 10$ assures a normal distribtuion for sampling distribution.

Visual Summary

In [75]:
%matplotlib inline
from SDSPSM import plot_SDSP

# PLOT THE RESULTS
from SDSPSM import plot_SDSP, get_normal_curve_label
backgroundColour = '#F5F5FF'
fig, axarr = plt.subplots(2,3, figsize=(15,10), facecolor=backgroundColour)

pop_axarr = [axarr[0,0],axarr[0,1], axarr[0,2]]
titles = ['Population Distribution','Population Discrete\nDensity Function','Population Probability Mass\nFunction (PMF)']
plot_SDSP(population, pop_axarr, titles, index_list=['A','B','C'], norm_off=True)

stat_axarr = [axarr[1,0],axarr[1,1], axarr[1,2]]
titles = ['Statistical Distribution','\nStatistical Discrete Density Function','Statistical Probability Mass\nFunction (PMF)']
plot_SDSP(Y_mean_list, stat_axarr, titles, index_list=['D','E','F'])

plt.tight_layout()
plt.show()

Note that, normal approximation is best fit not for our discrete probability mass function F, but for Discrete Density Function E. This is because, probability density function has area equal to 1 (total probability is 1), so our discrete function "bars"'s heights should total to 1 as well. This is not possible on our last derived discrete function F, but intermediate density function E.

Examples

Example 1

The safety officers of a mining company take a Simple Random Survey of $500$ employees and finds that $15\%$, percent of the sampled employees wear contact lenses. The safety officers may take several more samples like this. Suppose it is really $12\%$, percent of the approximately $34,000$ employees in the company who wear contact lenses.

What are the mean and standard deviation of the sampling distribution of the proportion of employees who wear contact lenses? Ref: Khan Academy

Ans

The problem at hand is what proportion wears contact lenses. Our random variable indicates they wear contact lenses or not, thus a Bernoulli one. Its a categorical problem. Thus we deal as sample proportions. Since in any Bernoulli distribution, $\mu = p$, we proceed as follows.

Given True population parameters: $p = \mu = 0.12$

One Sample result: n=500, $\widehat{p_1} = 0.15$.

Note officer may take more like this. So eventually with many experiments of 500 samples, one should arrive at normal distribution (normal assured because $np = 500*0.12 = 60 > 10$)

So resultant normal sampling distribution will have below parameters.

$$ \mu_\widehat{p} = \mu = 0.12 \\ \sigma_\widehat{p} = \sqrt{\dfrac{pq}{n}} = \sqrt{\dfrac{0.12(1-0.12)}{500}} $$


Typically problems extend further to also calculate probability. But once we know the sampling distribution parameters mean and variance, it is easy to calculate probability of any case next naturally, as we have standard methods to derive probability from normal distributions. Only take care they meet the condition of sample proportion (np > 10 and nq > 10) or sample means (n>30). Since finding probability from normal distribution is currently not yet covered, we skip that part. What one should be careful is to detect if the question has a sample distribution associated and tackle it accordingly. The same applies to Sample means we see next as well.

Sampling Distribution of Sample Means

This typically applies to case where the variable under study is continuous. For eg, a temperature distribution over a range of values.

Temperature

Setup

Random:

Let Y be the random variable indicating temperature over a distribution of certain values.

If limiting values are say, 0 deg C to 40 deg C, our population would thus look like this: $[23,13,35,50,10,2,5,0,33, \cdots ,21]$

Unlike Sample proportions,we do not know or designate any proportion of temperatures in this example, but we know the mean and variance by simply calculating all values in the distribution. These would be our population parameters.

Population mean $\mu = \mu_y$
Population variance $\sigma^2 = \sigma_y^2$

Since it is a random distribution, there is no theoretical calculation of these values, so we create such a population distribution programmatically, and calculate their parameters.

In [76]:
%matplotlib inline
from math import floor
import matplotlib.pyplot as plt
from random import random, seed, shuffle
from SDSPSM import get_metrics, drawBarGraph, getPopulationStatistics

seed(0)

popMin = 1   # Min population
popMax = 40  # Max population
freqMax = 50 # freq of any set of population (for eg, no of occurances of temperatures at 25 deg C)

def createRandomPopulation(N):
    """
    Create a random distribution for N values
    """
    population = []
    population_freq = []
    for i in range(0,N):
        temp_freq = (floor(random() * freqMax))  # random frequency for each population
        temp_list = [i]*temp_freq
        population += temp_list
        population_freq.append(temp_freq)
    shuffle(population)
    return population, population_freq

population, population_freq = createRandomPopulation(popMax - popMin + 1)

#mu, var, sigma = get_metrics(population) # just to cross check..
#print(mu, var, sigma)  

N, mu, var, sigma = getPopulationStatistics(population_freq, popMin)

#visualize
backgroundColour = '#F5F5FF'
fig, (ax1) = plt.subplots(1,1, figsize=(16,3))
drawBarGraph(population_freq, ax1, [N, mu, var, sigma], 'Population Distribution','Temperature', 'Counts')
ax1.set_facecolor(backgroundColour)
fig.patch.set_facecolor(backgroundColour)
plt.show()

From the above graph, we have temperature ranging from 1 deg C to 40 deg C, and each temperature with random frequency. For eg, 4 deg C happens about 10 times indicated by height of bar at Temperature=4 and Counts = 10.

Summarizing, our population parameters:

$$ \color {blue}{ \mu = \mu_y = 20.64 \\ \sigma^2 = \sigma_y^2 = 128.17 \\ \sigma = \sigma_y = \sqrt{128.17} = 11.32 } \tag{4} $$

Let us visualize the density and PMF as usual..

In [77]:
def mini_plot_SDSM(raw_list, bins, width=0.5):

    ax1.hist(raw_list, bins) 
    ax1.set_title('Distribution')

    n, bins,_ = ax2.hist(raw_list, bins, density=True) 
    ax2.set_title('Density')

    # probability mass
    dummy_dict = {i:raw_list.count(i) for i in raw_list}
    total = sum(list(dummy_dict.values()))
    pmf = {key: round(val/total,4) for key, val in dummy_dict.items()}
    ax3.bar(list(pmf.keys()), list(pmf.values()), width=width)
    ax3.set_title('PMF')

fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15,4))
mini_plot_SDSM(population, popMax, width=1)
plt.show()

Statistical Outcomes:

Let us randomly sample from the population. One trial/experiment is sampling about, say $n=10$ samples. Likewise, we sample for many experiments, say $N=2000$.

Remember: We sample WITH REPLACEMENT

This step is same as we did in Sample proportions case

Steps:

  • 1.Take a random sample of $n=10$ values from the population. Example sample output: $\widehat{Y_1} = [33,20,40,1,22, \cdots ,37]$ denoting our 1st sample of size $n=10$. ^ just indicates, its a statistical outcome
  • 2.Take the mean of it and note down for kth experiment. $\overline{\widehat{Y_k}} = \dfrac {1}{n} \sum\limits_{i=1}^n Y_{ki} = \dfrac {1}{10} \sum\limits_{i=1}^{10} Y_{ki}$. Example sample output: $\overline{\widehat{Y_1}} = 11.34$
  • 3.Do step 1 and 2 for $N=2000$ times. That is, $N=2000$ experiments $\implies$ $2000$ mean values $\overline{\widehat{Y_k}}$, for $k = 0,1,2,3.... N$
    Example total output:
$\widehat{Y_k}$ $\overline{\widehat{Y_k}}$
1 33.2
2 22.1
3 20.4
4 40.2
5 11.3
... ...
2000 23
  • 4.Take the frequency count of the samples and prepare a concise frequency chart., with each row of mean value against its frequency.
    Example total output:
$\overline{\widehat{Y_j}}$   $n(\overline{\widehat{Y_j}})$
1 8
2 23
3 65
... ...
39 20
40 5
  • 5.Calculate probability for each row of above table, by simply taking the frequency divided by total outcomes.Note denomitor sums up to 2000 because, statistically we get 1 outcome per experiment (which is a mean), so total no of occurances after 2000 experiments would be 2000 outcomes (2000 means).
    $ p\Big(\overline{\widehat{Y_j}}\Big) = \dfrac {n\Big(\overline{\widehat{Y_j}}\Big)}{\sum n\Big(\overline{\widehat{Y_j}}\Big)} = \dfrac {n\Big(\overline{\widehat{Y_j}}\Big)}{2000} $
$\overline{\widehat{Y_j}}$   $n(\overline{\widehat{Y_j}})$   $p(\overline{\widehat{Y_j}})$
1 8 0.0001
2 23 0.002
3 65 0.034
... ...
39 20 0.002
40 5 0.0002

From above table, one could draw the probability mass function which is a discrete probability distribution where one could find for any $k$, the probability $p(X=k)$ discretly

  • 6.Calculate mean, variance and SD as below.Suppose there are total M rows in the table. We would find that,
    $ \mu(\overline{\widehat{Y}}) = \overline{\overline{\widehat{Y}}} = \sum\limits_{j=1}^M \overline{\widehat{Y_j}}p(\overline{\widehat{Y_j}}) \approx \color {blue}{19.69}\\ \\ \sigma(\overline{\widehat{Y}})^2 = \sum\limits_{j=1}^M \Big(\overline{\widehat{Y_j}} - \mu(\overline{\widehat{Y}})\Big)p(\overline{\widehat{Y_j}}) \approx \color {blue}{12.4631}\\ \\ \therefore \ \ \sigma(\overline{\widehat{Y}}) \approx \sqrt{12.4631} = \color {blue}{3.5303} \tag {5} $

As usual, its much simpler programmatically

In [78]:
from random import choices

N = 2000 # No of experiments
n = 10   # Sample size

Y_hat = []
Y_mean_list = []
for each_experiment in range(N):  
    Y_hat = choices(population, k=n)   # sample with replacement
    Y_mean = sum(Y_hat)/len(Y_hat)
#     print(Y_mean)
    Y_mean_list.append(Y_mean)

mu, var, sigma = get_metrics(Y_mean_list)
print(mu, var, sigma)
19.6139 13.0506 3.6126

A quick visualization of our resultant statistical outcome Y_mean_list

In [79]:
import matplotlib.pyplot as plt

fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15,4))

mini_plot_SDSM(Y_mean_list,popMax )
plt.show()

Notations and Conditions

You could already see the resultant sampling distribution shaping up as normal. A visual summary is provided below where, normal distribution approximation is also applied.

Back to variables, the complicated statistical notation are typically indicated by simpler notations as follows. We used former because we did not want to lose details for sake of simplicity.Now that we have the insight, we could swap with these simpler conventional notations.

Note this is slightly different notation compared to Sample proportions case

$$ \color {blue}{ \mu_\overline{Y} = \mu(\overline{\widehat{Y}}) \\ \\ \sigma_\overline{Y} = \sigma(\overline{\widehat{Y}}) } $$

Comparing equations set $4$ and $5$,

$$ \color {blue} { \mu_\overline{Y} \approx 20.6 = \mu \\ \\ \sigma_\overline{Y} \approx 3.5303 \approx \dfrac{11.32}{\sqrt{10}} = \dfrac {\sigma}{\sqrt{n}} \tag {3} } $$

$\overline{Y}$ is called the sample means. The resultant sampling distribution could be satisfactorily approximated by normal density function when sample size is sufficiently large.

Conditions:

How much is sufficiently large? General thumb rule is $n > 30$. However, if population distribution is already normal, even if n is small, resultant sampling distribution would be normal.

Visual Summary

In [80]:
%matplotlib inline
from SDSPSM import plot_SDSM

# PLOT THE RESULTS
from SDSPSM import plot_SDSP, get_normal_curve_label
backgroundColour = '#F5F5FF'
fig, axarr = plt.subplots(2,3, figsize=(15,10), facecolor=backgroundColour)

pop_axarr = [axarr[0,0],axarr[0,1], axarr[0,2]]
titles = ['Population Distribution','Population Discrete\nDensity Function','Population Probability Mass\nFunction (PMF)']
plot_SDSM(population, popMax, pop_axarr, titles, index_list=['A','B','C'], norm_off=True)

stat_axarr = [axarr[1,0],axarr[1,1], axarr[1,2]]
titles = ['Statistical Distribution','\nStatistical Discrete Density Function','Statistical Probability Mass\nFunction (PMF)']
plot_SDSM(Y_mean_list, popMax, stat_axarr, titles, index_list=['D','E','F'])

plt.tight_layout()
plt.show()

Again recall that, normal approximation is best fit not for our discrete probability mass function F, but for Discrete Density Function E. This is because, probability density function has area equal to 1 (total probability is 1), so our discrete function "bars"'s heights should total to 1 as well. This is not possible on our last derived discrete function F, but intermediate density function E.

Examples

Example 1:

A machine automatically dispenses a beverage of a desired size. When set to small, the machine dispenses varying amounts of liquid with a mean of $275mL$, space, m, L and standard deviation of $10mL$. Suppose that we take random samples of 5 of these drinks and calculate the mean amount of liquid $\bar{x}$ in each sample. We can assume that individual drinks are independent.

Calculate the mean and standard deviation of the sampling distribution of $\bar{x}$

Ans: The problem at hand is about varying amounts of liquid - this is a continuous random distribution.Thus we deal as sample means.

True population parameters: $\mu = 275\ mL \ \ \ \sigma=10\ mL$ Sample Size = 5

$\therefore$ $$ \mu_{\overline{Y}} \approx \mu = 275 \ mL\\ \sigma_{\overline{Y}} \approx \dfrac {\sigma}{\sqrt{n}} = \dfrac {10}{\sqrt{5}} = 4.47 \ mL $$

Summary

image.png

Wish to do

  1. Theoretical Derivation for both Sample proportions and Sample means. blocking point
  2. Visualization for Conditions for Sample proportions and Sample means. blocking point

Z scores

Using Normal Distribution and Z Score

1. Using Normal Distribution

1.1 Using Empirical Rule (68 - 95 - 99.7%)

This is simplest form of solving probability for a given problem. These we could typically do without calculator, as they are very straight forward.

It simply means, we assume that 68% of the data is within 1 standard deviation, 95% is within 2 standard deviation, 99.7% is within 3 standard deviations.

image.png

Example:

The lifespans of lizards in a particular zoo are normally distributed. The average lizard lives 3.1 years; the standard deviation is 0.6 years, 6 years. What is the probability of a lizard living less than 2.5 years.

Given:

$ \mu = 3.1 \\ \sigma = 0.6 \\ p(X < 2.5) = ? $

Approach:

Let us check if 2.5 is a multiple of SD from mean,..
$ \mu + a\sigma = 2.5 \\ 3.1 + a(0.6) = 2.5 \\ a(0.6) = -0.6 \\ a = -1 $

This exact multiple is an indication we could write away use empirical rule without having to integrate anything. Sometimes questions explicitly state, and sometimes not, so we could check and apply accordingly. If they say "without using calculator", that could be a hint as well.

In [2]:
import matplotlib.pyplot as plt
from normalviz import draw_normal

mu = 3.1
sigma = 0.6   

# plot
fig, ax = plt.subplots(1,1, figsize=(7,4))
draw_normal(ax, mu, sigma, 'x<2.5')  # question is x less than 2.5 years
ax.set_xlabel('No of years lizards live')
ax.set_ylabel('Probability that they \nlive those many years')
plt.show()

As you can see above, we just need to find the area of shaded area, and 2.5 is exactly at $\mu-\sigma$.

The total area of curve = 100% (as total probability is always 1)

Area from $\mu$ to $\infty$ = $p(X>\mu)$ = 50%
Area between $(\mu-\sigma)$ and $(\mu + \sigma)$ = $p\Big((\mu-\sigma)<X<(\mu+\sigma)\Big)$ = 68% $\ \ \because$ emperical rule

Thus,
Area between $(\mu-\sigma)$ and $\mu$ is $p\Big((\mu-\sigma)<X<\mu\Big)$ = $\dfrac {68}{2} = 34\%$

Now, we could easily calculate area of our interest as
$ p\Big( X < (\mu-\sigma) \Big) = 1 - \Bigg(p\Big((\mu-\sigma)<X<\mu\Big) + p(X>\mu)\Bigg) = 1 - \Big( 0.34 + 0.5 \Big) = 0.16 $

Thus solution is 16%

1.2 Using Integrals

What if the number in question is not multiple of standard deviation?

Example:

A set of middle school student heights are normally distributed with a mean of 150 cm and standard deviation of 20 cm. Darnell is a middle school student with a height of 161.4 cm. What proportion of student heights are lower than Darnell's height? Give your answer correct to four decimal places.

Given: $ \mu = 150 \ \ \ \ \sigma = 20 \ \ \ \ p(X<161.4)=? $

Approach:
$ \mu + a\sigma = 161.4 \\ 150 + a(20) = 161.4 \\ a(20) = 161.4 - 150 = 11.4 \\ a = \frac {11.4}{20} = 0.57 $

So Darnell is $0.57\sigma$ away from the mean, not exact multiple of sigma.

In [4]:
mu = 150
sigma = 20

# plot
fig, ax = plt.subplots(1,1, figsize=(7,4))
draw_normal(ax, mu, sigma, 'x<161.4')  # question is x less than 2.5 years
ax.set_xlabel('No of years lizards live')
ax.set_ylabel('Probability that they \nlive those many years')
plt.show()

Remember, a normal function is given by,

$ p(X) = \dfrac {1}{\sigma\sqrt{2\pi}}e^{-\dfrac{1}{2} \Big( \dfrac{X-\mu}{\sigma} \Big)^2 } $

So we need to integrate to calculate the area of interest as shown above.

$ \displaystyle { p(X<161.4) = \int_{-\infty}^{161.4} \dfrac {1}{\sigma\sqrt{2\pi}}e^{-\dfrac{1}{2} \Big( \dfrac{X-\mu}{\sigma} \Big)^2 } dx \ = \dfrac {1}{\sigma\sqrt{2\pi}} \int_{-\infty}^{161.4} e^{-\dfrac{1}{2} \Big( \dfrac{X-\mu}{\sigma} \Big)^2 } dx \ = \dfrac {1}{\sigma\sqrt{2\pi}} \int_{-\infty}^{161.4} e^{-\dfrac{1}{2} \Big( \dfrac{X-150}{20} \Big)^2 } dx \\ \\ } $

This integration is highly complicated, and thus we have something called z-score, but before that, we will leave the hardway of integrating to python and see what it churns out.

In [14]:
from scipy.integrate import quad
import numpy as np
from math import sqrt, pi

def pdf(x):
    C = 1/(sigma*sqrt(2*pi))
    E = -(((x-mu)/sigma)**2)/2
    G = C*np.exp(E)
    return G

xstart = mu - 4*sigma
xend = mu + 4*sigma
x = np.linspace(xstart, xend, 100)

# Integrate PDF from -inf to x
result, _ = quad(pdf,  -np.inf, 161.4, limit = 1000)
print(round(result,4))
0.7157

Cool! The answer is $\color {blue}{0.7157}$ or 71.57%

2. Z Score

2.1 Z Score to easily evaluate area

Earlier days, one has to painfully integrate the above example, and that too for every such problem! If you remember, our normal approximation function is a standard guassian function shifted by mean and scaled by sigma as required to fit our probability distribution accordingly.

This means, we could transform every such normal distribution back to the standard normal distribution. If we have all possible areas calculated (well, as much as possible), we could simply refer to those values, because the areas would remain the same.

The area information we have for standard normal deviation is called Z table. Note that, when we shift the curve, we also need to have the area of interest shifted accordingly. This is where, we calculate something called Z score which is a standard value for how far our question of interest, is from the mean. You will understand that in a minute.

This is our current normal approximation function.

$ p(X) = \dfrac {1}{\sigma\sqrt{2\pi}}e^{-\dfrac{1}{2} \Big( \dfrac{X-\mu}{\sigma} \Big)^2 } $

Let $Z = \dfrac {X-\mu}{\sigma}$, then

$ p(Z) = \dfrac {1}{\sigma\sqrt{2\pi}}e^{-\dfrac{1}{2} \Big( Z \Big)^2 } $

When, $X = \mu$, $Z = \dfrac {X-\mu}{\sigma} = 0$
When, $X = \mu + \sigma$, $Z = \dfrac {X-\mu}{\sigma} = \dfrac {\mu + \sigma - \mu}{\sigma} = 1$

Thus, on Z scale, $\mu = 0$ and $\sigma = 1$. One unit on Z scale will indicate $1\sigma$ on X scale. Any probability distribution could be transformed to this standard form, thus facilitating easy area calculation.

Below is how, our standard function looks like.

In [17]:
def normal_pdf(z):
    C = 1/(sqrt(2*pi))
    E = -(z**2)/2
    G = C*np.exp(E)
    return G

mu = 0
sigma = 1
zstart = mu - 4*sigma
zend = mu + 4*sigma
z = np.linspace(zstart, zend, 100)

# plot
fig, ax = plt.subplots(1,1, figsize=(7,4))
draw_normal(ax, mu, sigma)  
ax.set_xlabel('No of years lizards live')
ax.set_ylabel('Probability that they \nlive those many years')
plt.title('Standard Normal Distribution',y=1.1)
plt.show()

Oh Wait! What about our area of interest? We have not done that yet.

When $X = 161.4 \ \ Z = \dfrac {X-\mu}{\sigma} = \dfrac {161.4-150}{20} = 0.57$

Applying that..

In [18]:
# plot
fig, ax = plt.subplots(1,1, figsize=(7,4))
draw_normal(ax, mu, sigma, 'x<0.57')   # for this function, x is x. mu and sigma takes care of the shift
ax.set_xlabel('No of years lizards live')
ax.set_ylabel('Probability that they \nlive those many years')
plt.title('Standard Normal Distribution',y=1.1)
plt.show()

Now we just have to find out area as shown above. That is, $p(Z<=0.57)$

Let us check the Z table. The value in the table always indicate the area below the limit value we are looking for. Here, we would get area below $Z=0.57$

For brevity, I have only shown a part of table, but such tables can be easily found online like one here

z%20table%200.57.png

The answer is 0.7157 again!! This is exactly the same we got via integral earlier!

So that is all, the use of Z-scores? No, it could also be useful to compare different probability distributions.

By the way, the value we arrived at, Z = 0.57 is called Z-score, or Z-score of Darnell.


Given the mean $\mu$, variance $\sigma$ and point of interest X, the Z score can be calculated as $Z=\dfrac{X-\mu}{\sigma}$


2.2 Z Score to Compare

Example:

Juwan took 2 exams. He scored 172 on the LSAT and 37 on the MCAT. Which exam did he do relatively better on? Given below are the probability distributions for LSAT and MCAT.

Exam Mean Standard deviation
LSAT 151 10
MCAT 25.1 6.4

We are not asked, proportion of students below or above Juwan's LSAT score or MCAT score. We are asked to compare Juwan's own scores. But both exams are different.

The key idea is this:

By standardizing Juwan's LSAT and MCAT score to standard normal distribution, we are bringing down different distributions to same scale. Now we could compare his both Z-scores.

And note, we do not even have to calculate area for this problem. However, even if needed, its not a big deal as we now know how to calculate area from previous examples.

Given: $ \mu_x = 151, \ \sigma_x = 10 \\ \mu_y = 25.1, \ \sigma_y = 6.4 \\ X = 172, Y = 37 $

Approach: Calculate Z scores
$ Z_x = \dfrac {172-151}{10} = 2.1 \\ Z_y = \dfrac {37-25.1}{6.4} = 1.86 \\ $

Ans:
$Z_x > Z_y$, so Juwan did LSAT better.

Also note, as long as we have a mean and standard deviation, this transformation to Z score will be possible and valid. The distributions need not be normal always.

Sum of Dice

Calculating sum of dice outcomes

Sum of Dice

We had earlier seen the graphs for dice with outcomes = {0,1,2}, and with 2 or 3 flips. What happens when number of flips is large. One could guess it converges to a Normal distribution curve, but we want to see that in action. This poses a unique challenge. Calculating all possible outcomes even for a moderate no of flips (eg, 20) is very large, and takes significant computation times. In this article, we try to solve that.

In [30]:
from anyflipviz import get_combinations_consolidated, plot_combinations_consolidated

final_df = get_combinations_consolidated(n_outcomes = 3, n_flips=2)
plot_combinations_consolidated(final_df)

Slow Algorithm

Below snippet is an implementation of our slow algorithm we have been using so far. Note the approach is to compute every possible combination and then simply count them.

Let,

'n' = no of dice or no of tosses (mathematically both are equal)
's' = no of sides or no of outcomes
'p' = the total sum for which we want to calculate no of occurances (In graph this is denoted by X)

In [2]:
# https://stackoverflow.com/questions/40313058/all-permutations-of-dice-with-a-set-range-and-sum
import itertools
def subsets(n,p,s):
     perms = itertools.product(s, repeat=n)
     for i in perms:
         if sum(i) == p:
             yield i

No of tosses = 3, outcomes per dice = {0,1,2}

In [3]:
n = 3 # no of tosses or no of dices

s = range(0,n) # 0,1,2
p_min = min(s)*n
p_max = max(s)*n
p = range(p_min, p_max+1)  # 0,1,2,3,5,6  p is not probability, but sum, letter p indicating 'points' of dice

result_coeff = []
for i in p:
    result = list(subsets(n,i,s))
    print(result)
    result_coeff.append(len(result))
print(result_coeff)
[(0, 0, 0)]
[(0, 0, 1), (0, 1, 0), (1, 0, 0)]
[(0, 0, 2), (0, 1, 1), (0, 2, 0), (1, 0, 1), (1, 1, 0), (2, 0, 0)]
[(0, 1, 2), (0, 2, 1), (1, 0, 2), (1, 1, 1), (1, 2, 0), (2, 0, 1), (2, 1, 0)]
[(0, 2, 2), (1, 1, 2), (1, 2, 1), (2, 0, 2), (2, 1, 1), (2, 2, 0)]
[(1, 2, 2), (2, 1, 2), (2, 2, 1)]
[(2, 2, 2)]
[1, 3, 6, 7, 6, 3, 1]

No of tosses = 3, outcomes per dice = {1,2,3}

In [4]:
n = 3  # no of flips or no of dices

s = range(1,n+1) # s-sided = 1,2,3
p_min = min(s)*n
p_max = max(s)*n
p = range(p_min, p_max+1)  # 2,3,4,5,6, p is not probability, but sum, letter p indicating 'points' of dice

result_coeff = []
for i in p:
    result = list(subsets(n,i,s))
    print(result)
    result_coeff.append(len(result))
print(result_coeff)
[(1, 1, 1)]
[(1, 1, 2), (1, 2, 1), (2, 1, 1)]
[(1, 1, 3), (1, 2, 2), (1, 3, 1), (2, 1, 2), (2, 2, 1), (3, 1, 1)]
[(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 2, 2), (2, 3, 1), (3, 1, 2), (3, 2, 1)]
[(1, 3, 3), (2, 2, 3), (2, 3, 2), (3, 1, 3), (3, 2, 2), (3, 3, 1)]
[(2, 3, 3), (3, 2, 3), (3, 3, 2)]
[(3, 3, 3)]
[1, 3, 6, 7, 6, 3, 1]

Note that, when no of outcomes = {1,2,3...}, starting with non zero outcome like 1 here, the minimum sum would be min(s)*n, that is, if no of flips = 3, minimum sum = (1,1,1) = 3, so our X value starts from that, instead of 0 like earlier cases. Note this results in graph shifting in X axis.

Faster Algorithm

Fortunately we have a formula to make use of as below. This assumes outcomes start with 1. That is if s = 3, it would be {1,2,3} and not {0,1,2}.

The probability of a n s-sided dice, to have sum p would be

$$ \color {blue} {P(p,n,s) = \dfrac {1}{s^n} \sum\limits_{k=0}^{k_{max}} (-1)^k \binom {n}{k} \binom {p-sk-1}{n-1} } $$

where,

$$ \color {blue} {k_{max} = \Big\lfloor \dfrac {p-n}{s}\Big\rfloor } $$ is a floor function. If we are interested in only number of favourable outcomes, we could ignore $\frac {1}{s^n}$ whic his just used to calculate probability (which is also what we did in our below implementation)

For more details about formula, please check here

In [5]:
def ncr(n, r):
    """
    Calculate n choose r
    https://stackoverflow.com/questions/4941753/is-there-a-math-ncr-function-in-python
    """
    import operator as op
    from functools import reduce
    r = min(r, n-r)
    numer = reduce(op.mul, range(n, n-r, -1), 1)
    denom = reduce(op.mul, range(1, r+1), 1)
    return numer//denom

def subsets_turbo(n,p,s):
    """
    Finding no of combinations whose sum is p, for s-sided dice for n-no of flips (or n dices) using formula
    http://mathworld.wolfram.com/Dice.html
    """
    from math import floor
    k_max = floor((p-n)/s)

    # summation
    result = 0
    for i in range(0, k_max+1):  # k_max inclusive..

        C_1 = (-1)**i
        C_2 = ncr(n, i)
        C_3 = ncr(p-s*i-1,n-1)
        C_4 = C_1*C_2*C_3
        result += C_4

    return result

# A simple test 

n = 3 # no of flips
s = range(1,n+1) # s-sided = 1,2,3
n_s = len(s)
p_min = min(s)*n
p_max = max(s)*n
p = range(p_min, p_max+1)  # 2,3,4,5,6, p is not probability, but sum, letter p indicating 'points' of dice

result_coeff = []
for i in p:
    result = subsets_turbo(n,i,n_s)
    result_coeff.append(result)
print(result_coeff)
[1, 3, 6, 7, 6, 3, 1]

Converting to our standard pandas dataframe to use our older graphing methods..

In [26]:
import pandas as pd

def get_combinations_consolidated_turbo(n_outcomes = 3, n_flips = 3):
    """
    Given the  no of outcomes per flip, no of flips, and probability of each sum 
    this will provide dataframe: x, n(x) and p(x)
    WITHOUT CALCULATING INDIVIDUAL COMBINATIONS
    Formula: http://mathworld.wolfram.com/Dice.html
    """
    n = n_flips # no of flips

    s = range(1,n_outcomes+1) # s-sided = 1,2,3

    n_s = len(s)
    p_min = min(s)*n
    p_max = max(s)*n
    p = range(p_min, p_max+1)  # 2,3,4,5,6, p is not probability, but sum, letter p indicating 'points' of dice   

    try:
        X = p
        N = [subsets_turbo(n,i,n_s) for i in p]
        df = pd.DataFrame({'x': X, 'n(x)': N}, dtype='object') #dtype needed to avoid pandas overflow error
        total_outcomes = df['n(x)'].sum()
        df['p(x)'] = df['n(x)']/total_outcomes
        df = df[['x','n(x)','p(x)']]
        return df
    except Exception as e:
        e = str(e)
        print("X:{} Max(N):{} len(N):{} Unexpected error:{}".format(X, max(N), len(N), e))
        raise

final_df = get_combinations_consolidated_turbo(3,3)
plot_combinations_consolidated(final_df)

No of tosses = 50, outcome = {1,2,3}

Now we are up to the challenge.

In [29]:
n_flips = 50  # or also indicate no of dices
n_outcomes = 3  # {1,2,3}

final_df = get_combinations_consolidated_turbo(n_outcomes,n_flips)
plot_combinations_consolidated(final_df,label=False)

Random toss

We will also implement simulation of tossing given dice with given number of outcomes, n no of times. If we consider this as one experiment, then each experiment thus could give only one output out of all possible final sequences. So we will conduct this experiment again a fixed number of times, to see how all outcomes stack up.

In [44]:
# statistical toss functions.. 
from random import choice, seed
import numpy as np
seed(0)  # just for consistent result every time..   
def toss(n_toss, n_outcomes):
    """
    Toss given dice with n_outcomes, for n_toss times. Each outcome has equal probability.
    Thus a single toss gives a uniform distribution. 
    """

    final_sequence = []

    s = range(1,n_outcomes+1) # s-sided = 1,2,3    
    n_s = len(s)

    for i in range(n_toss):  # 0 to (n_toss-1) times..
        toss_result = np.random.choice(s) # uniform distribution assumed
        final_sequence.append(toss_result)
        #rint(toss_result)
    return sum(final_sequence)

def sample(n_experiments, n_toss, n_outcomes):
    """
    Conduct experiment given no of times
    In each experiemnt, toss given no of times, and update n_X
    """
    from collections import defaultdict
    samples = defaultdict(lambda: 0)
    for each_experiment in range(0, n_experiments):
        X = toss(n_toss, n_outcomes)  # X is sum of outcome sequence of n_toss        
        samples[X] += 1   # constructing n(X)
        #print(each_experiment, dict(samples))

    # convert to pandas
    df = pd.DataFrame([[key,value] for key,value in samples.items()],columns=['x','n(x)'])
    df.sort_values('x', inplace=True)
    total_outcomes = df['n(x)'].sum()
    df['p(x)'] = df['n(x)']/total_outcomes
    return df

n_experiments = 2000
n_toss = 3
n_outcomes  = 3  # or no of sides for the dice
temp_df = sample(n_experiments, n_toss, n_outcomes)
plot_combinations_consolidated(temp_df)

Theoretical comparison:

Let us compare this with our equivalent theoretical outcome..

In [48]:
n_toss = 3  # or also indicate no of dices
n_outcomes = 3  # {1,2,3}

final_df = get_combinations_consolidated_turbo(n_outcomes,n_toss)
final_df['p(x)'] = final_df['p(x)'].apply(lambda x: round(x,5))
plot_combinations_consolidated(final_df)

Note how close our probability distributions (RHS) for both theoretical and statistical experiments.